feat: Add random state feature. #150

john-halloran · 2025-06-06T19:48:37Z

feat: Added random_state feature for reproducibility.

sbillinge

This is great!

We have to decide how much testing we will add. Ideal is 100% coverage, optimal is probably less.

Maybe write the docstrings so I can understand what the class does, then we can decide what to test?

src/diffpy/snmf/snmf_class.py

sbillinge · 2025-06-06T20:59:08Z

src/diffpy/snmf/snmf_class.py

+        MM,
+        Y0=None,
+        X0=None,
+        A=None,


more descriptive name?

There are many different standards for what to name these matrices. Zero agreement between sources that use NMF. I'm inclined to eventually use what sklearn.decomposition.non_negative_factorization uses, which would mean MM->X, X->W, Y->H. But I'd like to leave this as is for the moment until there's a consensus about what would be the most clear or standard. If people will be finding this tool from the sNMF paper, there's also an argument for using the X, Y, and A names because that was used there.

OK, sounds good. It has to be very good reason to break PEP8. The only good enough reason I can think of is to be consistent with scikit-learn. Another way of saying it is that we can "adopt the scikit-learn standard"

I'm fine with adopting the scikit-learn standard, but I also like the idea of giving them descriptive (and lowercase) names. The argument against that is that some lines of code use A/X/Y 10+ times in quick succession, so it would make the code very verbose.

more readable code is always better, so lower-case descriptive is preferred by me. I don't actually like that scikit-learn breaks this. Shall we go with lower-case? Names can be short if they are defined in a function in the docstring and docs too. Just hte code benefits from being readable, so I would say use your judgement on that.

I've started the conversion to lower case, but it's a large enough process (involving many poorly labeled sub-variables of the uppercase ones) that it feels like it should be its own separate PR. Does that make sense?

src/diffpy/snmf/snmf_class.py

sbillinge · 2025-06-06T21:01:50Z

src/diffpy/snmf/snmf_class.py

@@ -15,23 +27,22 @@ def __init__(self, MM, Y0=None, X0=None, A=None, rho=1e12, eta=610, max_iter=500
        # Capture matrix dimensions
        self.N, self.M = MM.shape
        self.num_updates = 0
+        self.rng = np.random.default_rng(random_state)


can we have a more descriptive variable name? Is this a range? What is the range?

ping on this one.

self.rng is not a numerical range. It's an instance of NumPy's default_rng() (introduced in NumPy 1.17), which is used to generate reproducible pseudo-random numbers when seeded with an integer from random_state. The actual range is chosen when you use the rng object in the code. We could change it, but both rng and random_state are standard names for these, including in scikit-learn.

OK, let's keep the name and we can say how it is used in the docstring. Something like "The value used to initialize the random state in ..." where .... differentiates it from the docstring for random_state which is presumably the same but used somewhere different?

Ah, looking at the code, I see tha tit is generated from random_state. If this is never accessed by the user, just passed around internally, then we could make it private. We do that by giving it an underscore in front, so self._rng and then we don't have to make a docstring for it. The users never access it, but it is available to internal functions.

sbillinge · 2025-06-06T21:02:32Z

src/diffpy/snmf/snmf_class.py

        if self.A is None:
-            self.A = np.ones((self.K, self.M)) + np.random.randn(self.K, self.M) * 1e-3  # Small perturbation
+            self.A = np.ones((self.K, self.M)) + self.rng.normal(0, 1e-3, size=(self.K, self.M))


K and M are probably good names if the matrix decomposition equation is in hte docstring, so they get defined there.

I think you addressed this with your comment to MM, but as a general rule, please respond to each comment so the reviewer knows you have seen it. It wouldn't work here, but just thumbs up works if you have seen a comment and agree, but it saves time in the long run as I don't have to write this long comment...... :)

Got it. I'd like to put the matrix decomposition in the docstring, but I'm having trouble formatting it. Might have to ask about this in one of the meetings.

yes, I am not 100% sure but I think there is a way.

john-halloran · 2025-06-06T21:09:28Z

This is great!

We have to decide how much testing we will add. Ideal is 100% coverage, optimal is probably less.

Maybe write the docstrings so I can understand what the class does, then we can decide what to test?

Thanks, will work on resolving these. To be clear, for things like the docstrings would you prefer I make new PRs, get those merged, then rebase this one, or just add to this existing PR?

john-halloran · 2025-06-08T06:23:10Z

For now, I will assume anything given as feedback in this PR could be in scope to include.

sbillinge

This is a great start. I left a couple of comments.

sbillinge · 2025-06-08T17:23:57Z

src/diffpy/snmf/snmf_class.py

@@ -17,6 +17,64 @@ def __init__(
        components=None,
        random_state=None,
    ):
+        """Run sNMF based on an ndarray, parameters, and either a number


This is fantastic! Thanks for this. Please see here for our docstring standards, I am not sure if you looked at it:
https://scikit-package.github.io/scikit-package/programming-guides/billinge-group-standards.html#docstrings

For classes it is a bit tricky because what info do we put in the "Class" docstring and what info do we put in the "constructor" (i.e., the __init__()) docstring. After some googling we came up with the breakout that is shown in the DiffractionObjects class that is shown there. We would be after something similar here.

By way of example, I would probably do like this in this case

def SNMFOptimizer: '''Configuration and methods to run the stretched NMF algorithm, sNMF Instantiating the SNMFOptimizer class runs all the analysis immediately. The results can then be accessed as instance attributes of the class (X, Y, and A). Please see <reference to paper> for more details about the algorithm. Attributes ----------- mm : ndarray The array containing the data to be decomposed. Shape is (length_of_signal, number_of_conditions) y0 : ndarray The array containing initial guesses for the component weights at each stretching condition. Shape is (number_of_components, number_of_conditions ... '''

put future development plans into issues, not in the docstring. Just describe the current behavior. Try and keep it brief but highly informational.

To conform to PEP8 standards I lower-cased the variables. I know they correspond to matrices but we should decide which standard to break. The tie-breaker should probably be scikit-learn. Whatever they do, let's do that. Let's also add a small comment (not in the docstring) to remind ourselves in the future if it breaks PEP8 or it will annoy me every time we revisit it and I will try and change it back......

Conditions on instantiation will go in the constructor docstring.

That one describes the init method so should look more like a function docstring. It would look something like....

def __init__(mm....) '''Initialize a SNMFOptimizer instance and run the optimization Parameters ------------ mm : ndarray The array containing the data to be decomposed. Shape is (length_of_signal, number_of_conditions) y0 : ndarray Optional. Defaults to None The array containing initial guesses for the component weights at each stretching condition. Shape is (number_of_components, number_of_conditions ...

I think there was some text before about how Y0 was required. But if it is required it may be better to make it a required (positional) variable in the constructor and not have it optional. we can discuss design decisions too if you like.

Either Y0 or n_components needs to be provided. Currently, Y0.shape overrides n_components if both are provided, and throws an error if neither are provided. The way scikit-learn is a little more flexible and also allows for an n_components which is different from Y0.shape, although I'm not clear on why you'd want that. But I'm not matching their behavior exactly because the current code doesn't allow that.

scikit-learn actually does break PEP8 to upper-case the matrices

sbillinge

good progress, please see comments.

sbillinge · 2025-06-10T01:24:47Z

src/diffpy/snmf/snmf_class.py

@@ -4,6 +4,18 @@


 class SNMFOptimizer:
+    """A self-contained implementation of the stretched NMF algorithm (sNMF),


This is too long. Needs to be < 80 characters, followed by a blank line.

sbillinge · 2025-06-10T01:25:23Z

src/diffpy/snmf/snmf_class.py

+
+    For more information on sNMF, please reference:
+    Gu, R., Rakita, Y., Lan, L. et al. Stretched non-negative matrix factorization.
+    npj Comput Mater 10, 193 (2024). https://doi.org/10.1038/s41524-024-01377-5


we would normally do a list of Class attributes here. Everything that is self.something. This is obviously strongly overlapped with the arguments of the constructor, as many of the attributes get defined in the constructor, but logically they are different. Here we list and dsecribe the class attributes, there we describe the init function arguments.

I'm not clear on how I'd distinguish the arguments from the attributes. I understand how they are different semantically, but what part of that is necessary to make clear here? Can you give an example? Those have been helpful.

everything that is self.something (except for methods which are self.functions() which are not considered attributes) is an attribute. So MM, Y0, X0 are attributes, but also M, N, rng, num_updates etc.

Inside a function or method the parameters are the arguments of the function. so for the __init__() function they will be MM, Y0, X0, A, rho, eta and so on). Some of the descriptions will overlap but for the function argument the user needs to know if it is optional or not, what the default is, and anything else they need to know to successfully instantiate the class. People will generally not see the two docstrings at the same time, so there can be some repetition, but try and keep it short but informative.

sbillinge · 2025-06-10T01:26:27Z

src/diffpy/snmf/snmf_class.py

-        of the class (X, Y, and A). Eventually, this will be changed such
-        that __init__ only prepares for the optimization, which will can then
-        be done using fit_transform.
+        """Initialize an instance of SNMF and run the optimization

        Parameters
        ----------
        MM: ndarray


these need a space before the colon (not sure why we adopted that standard, but we did). So mm : ndarray

sbillinge · 2025-06-10T01:28:49Z

src/diffpy/snmf/snmf_class.py

-            provided.
+            The array containing initial guesses for the component weights
+            at each stretching condition. Shape is (number of components, number of
+            conditions) Must be provided if n_components is not provided. Will override


normally we would raise an exception if two conflicting things are provided (we don't want to guess which is the right one) unless there is a good functional reason to do it another way. We like to avoid "magic" and the current behavior of the code could be "magic". Please raise an exception unless there is a strong reason to do otherwise.

src/diffpy/snmf/snmf_class.py

sbillinge · 2025-06-10T01:40:45Z

src/diffpy/snmf/snmf_class.py

+            without terminating the optimization. Note that a minimum of 20 updates
+            are run before this parameter is checked.
+        n_components: int
+            The number of components to attempt to extract from MM. Note that this will


attempt? So sometimes it extracs fewer than n_components when it attempts but doesn't manage?

It should never find less. "Attempt" means that sometimes the optimization may not work. But if this is unclear I can change it.

yes, delete "attempt" to make it clearer.

ping on this one?

sbillinge · 2025-06-10T01:42:42Z

src/diffpy/snmf/snmf_class.py

+            be overridden by Y0 if that is provided, but must be provided if no Y0 is
+            provided.
+        random_state: int
+            The integer which acts as a reproducible seed for the initial matrices used in


"The random seed used to initialize". I think the second sentence is useful information, but I think everyone will know what this is. btw, let's cross-check if you didn't already so we are using the names for common variables as scikit-learn.

I removed the second sentence, which I think is what you mean here. And yes, random_state is the name in scikit-learn.

Can we change also "The interger...." to "The random seed used to initialize...."

sbillinge · 2025-06-10T01:43:07Z

src/diffpy/snmf/snmf_class.py

@@ -15,23 +27,22 @@ def __init__(self, MM, Y0=None, X0=None, A=None, rho=1e12, eta=610, max_iter=500
        # Capture matrix dimensions
        self.N, self.M = MM.shape
        self.num_updates = 0
+        self.rng = np.random.default_rng(random_state)


ping on this one.

sbillinge · 2025-06-10T01:45:53Z

src/diffpy/snmf/snmf_class.py

        if self.A is None:
-            self.A = np.ones((self.K, self.M)) + np.random.randn(self.K, self.M) * 1e-3  # Small perturbation
+            self.A = np.ones((self.K, self.M)) + self.rng.normal(0, 1e-3, size=(self.K, self.M))


I think you addressed this with your comment to MM, but as a general rule, please respond to each comment so the reviewer knows you have seen it. It wouldn't work here, but just thumbs up works if you have seen a comment and agree, but it saves time in the long run as I don't have to write this long comment...... :)

sbillinge

good discussion. pls see my comments

sbillinge · 2025-06-13T00:48:02Z

src/diffpy/snmf/snmf_class.py

+            The stretching factor that influences the decomposition. Zero corresponds to no
+            stretching present. Relatively insensitive and typically adjusted in powers of 10.
+        eta : float
+            The sparsity factor than influences the decomposition. Should be set to zero for


typo than -> that

Also it might help to know a good value or range of values to choose when not setting it to zero?

Added the suggested adjustment factor, but the good range is honestly not clear yet across a broad set of data. Will add once it is.

sbillinge · 2025-06-13T01:02:09Z

src/diffpy/snmf/snmf_class.py

+
+    For more information on sNMF, please reference:
+    Gu, R., Rakita, Y., Lan, L. et al. Stretched non-negative matrix factorization.
+    npj Comput Mater 10, 193 (2024). https://doi.org/10.1038/s41524-024-01377-5


everything that is self.something (except for methods which are self.functions() which are not considered attributes) is an attribute. So MM, Y0, X0 are attributes, but also M, N, rng, num_updates etc.

Inside a function or method the parameters are the arguments of the function. so for the __init__() function they will be MM, Y0, X0, A, rho, eta and so on). Some of the descriptions will overlap but for the function argument the user needs to know if it is optional or not, what the default is, and anything else they need to know to successfully instantiate the class. People will generally not see the two docstrings at the same time, so there can be some repetition, but try and keep it short but informative.

sbillinge · 2025-06-13T01:03:31Z

src/diffpy/snmf/snmf_class.py

+        MM,
+        Y0=None,
+        X0=None,
+        A=None,


more readable code is always better, so lower-case descriptive is preferred by me. I don't actually like that scikit-learn breaks this. Shall we go with lower-case? Names can be short if they are defined in a function in the docstring and docs too. Just hte code benefits from being readable, so I would say use your judgement on that.

sbillinge · 2025-06-13T01:10:24Z

src/diffpy/snmf/snmf_class.py

        if self.A is None:
-            self.A = np.ones((self.K, self.M)) + np.random.randn(self.K, self.M) * 1e-3  # Small perturbation
+            self.A = np.ones((self.K, self.M)) + self.rng.normal(0, 1e-3, size=(self.K, self.M))


yes, I am not 100% sure but I think there is a way.

sbillinge · 2025-06-14T10:41:28Z

Thanks, will work on resolving these. To be clear, for things like the docstrings would you prefer I make new PRs, get those merged, then rebase this one, or just add to this existing PR?

Sounds good. Yes, in general, smaller PRs are easier to merge. We never rebase. You may have been using that term loosely, but it has a particular meaning. But yes everything will get merged together.

sbillinge

I have to run for a plane. I will make another review later, but this is getting very close now.

sbillinge · 2025-06-14T10:44:27Z

src/diffpy/snmf/snmf_class.py

+            without terminating the optimization. Note that a minimum of 20 updates
+            are run before this parameter is checked.
+        n_components: int
+            The number of components to attempt to extract from MM. Note that this will


ping on this one?

sbillinge · 2025-06-14T10:44:51Z

src/diffpy/snmf/snmf_class.py

+            be overridden by Y0 if that is provided, but must be provided if no Y0 is
+            provided.
+        random_state : int
+            The seed for the initial matrices used in the optimization.


this is a little unclear. What matrices?

sbillinge · 2025-06-14T10:47:07Z

src/diffpy/snmf/snmf_class.py

@@ -15,23 +27,22 @@ def __init__(self, MM, Y0=None, X0=None, A=None, rho=1e12, eta=610, max_iter=500
        # Capture matrix dimensions
        self.N, self.M = MM.shape
        self.num_updates = 0
+        self.rng = np.random.default_rng(random_state)


Ah, looking at the code, I see tha tit is generated from random_state. If this is never accessed by the user, just passed around internally, then we could make it private. We do that by giving it an underscore in front, so self._rng and then we don't have to make a docstring for it. The users never access it, but it is available to internal functions.

sbillinge

The docstrings are really good now, modulo a few small comments.

I want to understand why the SNMFOptimizer is a class and not a function (or rather a method in the SNMF class). Let's figure this out. It won't create that much work...less actually because we will need fewer docstrings.....so don't panic about wasted effort.

sbillinge · 2025-06-14T11:27:57Z

src/diffpy/snmf/main.py

-
-
-my_model = snmf_class.SNMFOptimizer(MM=MM, Y0=Y0, X0=X0, A=A0, components=2)
+my_model = snmf_class.SNMFOptimizer(MM=MM, Y0=Y0, X0=X0, A=A0, n_components=2)


Looking at this, it seems that we have already instantiated some kind of SNMF class, then this is doing the optimization. Is there a particular reason why we make this a class and not a function? It feels much more like a function to me. Could you think what the downsides are of making it a function? Is scikit-learn doing some thing like this too?

feat: Add random state feature.

d8d4e11

sbillinge reviewed Jun 6, 2025

View reviewed changes

Add class docstring

ae45726

components->n_components

3d7c8b6

sbillinge reviewed Jun 8, 2025

View reviewed changes

Updated docstring

a0483b4

sbillinge reviewed Jun 10, 2025

View reviewed changes

Shorten and reformat docstring

d39cbe0

sbillinge reviewed Jun 13, 2025

View reviewed changes

docstring typo

c783a02

sbillinge reviewed Jun 14, 2025

View reviewed changes

		@@ -4,6 +4,18 @@


		class SNMFOptimizer:
		"""A self-contained implementation of the stretched NMF algorithm (sNMF),



		my_model = snmf_class.SNMFOptimizer(MM=MM, Y0=Y0, X0=X0, A=A0, components=2)
		my_model = snmf_class.SNMFOptimizer(MM=MM, Y0=Y0, X0=X0, A=A0, n_components=2)

feat: Add random state feature. #150

Are you sure you want to change the base?

feat: Add random state feature. #150

Uh oh!

Conversation

john-halloran commented Jun 6, 2025

Uh oh!

sbillinge left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

john-halloran commented Jun 6, 2025

Uh oh!

john-halloran commented Jun 8, 2025

Uh oh!

sbillinge left a comment

Choose a reason for hiding this comment

Uh oh!

sbillinge Jun 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sbillinge left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sbillinge Jun 8, 2025 •

edited

Loading

sbillinge Jun 13, 2025 •

edited

Loading